TripFS Exposing File Systems as Linked Data
نویسنده
چکیده
File systems are highly interesting sources of information since large amounts of digital information are stored using plain file hierarchies. However the question of how file system data can be integrated into the Web of Data has not yet bet been sufficiently addressed. In this paper we give a short overview on TripFS, a Java-based server software that extracts RDF descriptions from a file system, links file resources to other relevant data sources, and exposes these data sets according to Linked Open Data principles. In many application contexts, hierarchical file systems are an important means of data storage for unstructured or heavily heterogeneous content. Files are used in a wide variety of applications, ranging from personal data on typical desktops, over shared folders in an enterprise, to data stored on a web server and published world wide. As these examples show, files are highly generic and universally usable. Files are typically accessed either by locating them via their absolute file path (usually by traversing through a directory hierarchy), or using a full-text search engine that keeps an index over file contents. Both approaches are well supported by common operating systems. However there exist no platform-independent mechanisms that allow file system contents to be integrated into a larger information network: files cannot be linked to other information objects, and their metadata descriptions cannot be processed in a platformand operating systemindependent manner. Because of these deficiencies, it is obvious to apply Linked Open Data principles to file systems. TripFS, presented in this paper, is a tool that represents directories and files as RDF resources, extracts metadata and creates links to other data sets, serves these metadata as RDF or HTML using content negotiation, and provides a SPARQL endpoint that allows clients to execute queries over the entire file system. TripFS is entirely implemented in Java and consists of six main components: Store. TripFS is implemented using the Jena Semantic Web Framework and abstracts over a concrete RDF storage. It has been successfully tested with an in-memory storage as well as on top of a PostgreSQL relational database. Crawler. On startup, TripFS crawls the file system starting from a given root directory. All files and directories are scheduled for subsequent extraction 1 A running demo instance can be accessed via http://demo.mminf.univie.ac.at: 9876; the corresponding SPARQL endpoint is available at http://demo.mminf. univie.ac.at:9876/sparql. and interlinking, Each file is assigned a globally unique, dereferenceable HTTP URI, which is independent from the file path and remains intact when the file is modified or moved. Watcher. TripFS tracks changes in the file system (creation, deletion, and modification of files); after a change, the affected files are re-scheduled for metadata extraction and interlinking. Hence the LOD representation is always in sync with the actual file system. Figure 1: TripFS HTML Rendition Extractors. TripFS provides a modular framework for metadata extraction. It re-uses components from the Aperture framework, which provides extractors for a number of popular file formats (including Office documents and multimedia data like images and audio files). The framework extracts data depending on the file format (e.g., title and artist from music files, width and height from images, a.s.f.) and stores them in the RDF store. Extracted data mostly adheres to the NEPOMUK Semantic Desktop ontologies. Linkers. Similar to extraction, a pluggable set of linkers can be instantiated within the TripFS server. After crawling and extraction, files are scheduled for linking. Currently we have implemented two experimental linker components, one that links music files (e.g., in the MP3 format) to Musicbrainz using artist name, track name, and duration, and one that links paginated documents (e.g., PDF or MS Word files) to ACM publication records based on the document title. These linkers generate owl:sameAs and rdfs:seeAlso links; a detailed analysis on the linking quality is subject to further research. Additional linkers and extractors can be easily added to the TripFS system by implementing corresponding interfaces. Web Server. RDF descriptions about files are served according to Linked Open Data principles as RDF (RDF/XML, Turtle, N3) and HTML (see Fig. 1), which can be accessed by the client using content negotiation. The HTML rendition is enriched with embedded RDFa descriptions and file system-specific graphic elements (e.g., file icons). Additionally the server provides a standardscompliant SPARQL endpoint. 2 http://aperture.sourceforge.net 3 http://www.semanticdesktop.org/ontologies 4 http://wiki.musicbrainz.org/RDF 5 http://acm.rkbexplorer.com
منابع مشابه
Lifting File Systems into the Linked Data Cloud with TripFS
A major fraction of digital information is stored in file systems. File systems organize files usually in labelled directory trees and provide a minimum support for user-driven file annotation, linkage and categorization. Although file systems play a major role in knowledge organization, both in enterprise contexts as well as in the personal information sphere, they have rarely been considered ...
متن کاملAd-hoc File Sharing Using Linked Data Technologies
A large fraction of our information, both in the professional and private domains, is stored in the form of files on our personal computers. When we collaborate with co-workers or meet with friends, mechanisms for sharing files and file annotations are frequently required. However, centralized file sharing infrastructures are often not available or complicated to set up, and approaches like pee...
متن کاملBeyond file systems: understanding the nature of places where people store their data
This paper analyzes the I/O and network behavior of a large class of home, personal and enterprise applications. Through user studies and measurements, we find that users and application developers increasingly have to deal with a de facto distributed system of specialized storage containers/file systems, each exposing complex data structures, and each having different naming and metadata conve...
متن کاملE2DR: Energy Efficient Data Replication in Data Grid
Abstract— Data grids are an important branch of gird computing which provide mechanisms for the management of large volumes of distributed data. Energy efficiency has recently emerged as a hot topic in large distributed systems. The development of computing systems is traditionally focused on performance improvements driven by the demand of client's applications in scientific and business domai...
متن کاملExposing French Agronomic Resources as Linked Open Data
The advancements in empirical technologies has generated vast amounts of heterogeneous data. This situation has created a need to integrate the data to understand the system of interest in its entirety. Therefore, information systems play a crucial role in managing these data, enabling the biologists in the extraction of new knowledge. The plant bioinformatics node of the Institut Français de B...
متن کامل